Voicing detection and pitch extraction system

ABSTRACT

Voicing detection and pitch extraction from speech sounds are achieved by means of an embodiment including a plurality of bandpass filters each having sufficient passing bandwidth to pass at least two harmonics of the fundamental voice frequency, whereby each provides a signal for all voiced sounds in the form of modulated waves, the envelopes of which having a periodicity equal to the voice fundamental. This periodicity is further enhanced by means of a hard limiter. A frequency discriminator whose input is provided by the band-pass filtered output of the limiter provides a voltage waveform whose special energy distribution is utilized for discrimination between voiced and unvoiced sounds.

United States Patent 1 1 3,600,516

[72] Inventor John H. K1ng,.lr. 2,561,478 7/1951 Mitchell 179/1 AS Endwell,N.Y. 2,691,137 10/1954 Smith 179/1 AS [211 App]. No. 829,414 2,927,969 3/1960 Miller.... 1 179/1 AS [22] Filed June 2, 1969 3,488,446 1/1970 Miller 179/1 AS [4'5] Patmed I M M Primary Examiner Kathleen H. Claffy [7H Asslgliec us new a es Assistant lbramirler---Jon Bradford Lcaheey Auumvys -Hnnllm1md .lmicln aml Amlrcw lnrns Armonk, NY. I

E T! N AND PITCH EXTRACTION I 154} BET C o ABSTRACT: Voicing detection and pitch extraction from speech sounds are achieved by means of an embodiment in- 3 Claims, 2 Drawing Figs.

eluding a plurality of band-pass filters each having sufficient U.S. a ing to pass at'least two harmonics the funda, [51] Int. Cl G101 l/04 mental voice frequency whereby each provides a Signal f n [50] Field of Search 179/1 AS, 1 voiced sounds i h form f dulated waves, the envelopes 1555 R of which having a periodicity equal to the voice fundamental. R f Ci ed This periodicity is further enhanced by meansof a hard [56] e t limiter. A frequency discriminator whose input is provided by UNITED STATES PATENTS the band-pass filtered output of the limiter provides a voltage 2,243,526 5/1941 Dudley 179/1 AS waveform whose special energy distribution is utilized for dis- 2,340,364 2/1944 Bedford 179/1 AS crimination between voiced and unvoiced sounds.

1 I 44 1' '1 I 0 15001. 1 N V 1 5 111 1 450 r HO I 5-21, 1 I h} 1 1 D1 1 P 4 2 r 2 I 14-20 I 1 1 FILTER RECTIFIER 6 1 1 1 7-1 AMP I H B P 4h MODJLATOR BANK BANK FILTER 1-BALANCED1 I 1 1 1 611-1 I 3-15 I 1-15.l 5c ems I 1 5-15.: 4-15 1 050 I I Dr- 1 5-15b A 5-150 l "I 1 11 10-1 r 10 1 1H 9 1----- I I 1 1 LOW -PASS I, rarourucv 11mm PASS 5: mm BAND PASS I 1 FILTER msc. I FILTER 1 FILTER I V I .J 1 1 -1 RECTIFI ER 17-1 1 M I LOW PASS I men PASS I g FILTER 1 I5 FILTER 1 1 1 L .J

PATENTED AUG] 1197: 350051 AMP FIG. I

FIG. 2

FILTER BANK RECTIFIER BANK SIGNAL PROCESSING NETWORK FREQUENCY i DISCRIMINATOR LOW PASS FILTER g volci N0 VOICE RECTIFIER DECISION NETWORKS FILTER AGE/VT JOHN H. KING JR.

VOICING DETECTION AND PITCH EXTRACTION SYSTEM BACKGROUND OF THE INVENTION The present invention is directed toward voicing detection and voice pitch extraction. The system embodiment employs a plurality of individual band-pass filters each having a bandpass width greater than the highest fundamental frequency of the voice, and sufficient to pass at least two harmonics. By virtue of the present invention, a measure of the speech waveform power spectrum periodicity, for all voiced sounds, issues as a modulated waveform having a periodicity equal to the voice fundamental. As a consequence, the periodicity of the speech waveform spectrum may be measured with a high degree of accuracy and reliability because the outputs corresponding to voice sounds are highly correlated, whereas random noise, background noises and nonvoiced speech sounds provide complex waveforms that have low correlations. In addition, the strength of the signal representative of the voice fundamental is greatly enhanced relative to other components of the modulated waveform by virtue of the signal processing properties of a hard limiter. A voltage level is also rendered representing the voice fundamental pitch by means of a frequency discriminator. Finally the spectral energy distribution of the output of the frequency discriminator is utilized for discriminating between voiced and nonvoiced sounds.

The invention is, accordingly, directed to overcome the inability-of prior art systems by being more accurately responsive to a wider variety of speech signals, especially those in which rapid fluctuations in the overall spectral energy distribution occur due to changes in the vocal tract cavity during production of a connected sequence of vowel sounds. The capability of the present invention is generally achieved by means of a system which obtains a measure of the speech power spectrum periodicity signal, by suitable nonlinear signal processing, and renders a substantially DC signal representation of the voice fundamental frequency substantially independent of the absolute amplitude of the voice signal.

OBJECTS The primary object of the invention is directed to a voicing detection system and voice fundamental pitch extraction system, which has a higher degree of accuracy and reliability and is less costly than voicing detection and voice pitch tracking systems of the prior art.

Another object resides in the capabilities of the present invention to provide more meaningful data at lower costs than the prior art systems.

Another object resides in the provision of a highly sophisticated system which derives meaningful voicing data predicated upon detecting and measuring the speech waveform power spectrum periodicity.

Yet another object resides in the provision of a voicing detection system which provides a high degree of discrimination between voiced and unvoiced sounds over a wide dynamic range.

Still another object resides in the provision of a voice fundamental pitch extraction system capable of operation over a wide dynamic range.

The foregoing and other objects, features and advantages of the invention will be apparent from the following more detailed description of the preferred embodiment of the invention as illustrated in the accompanying drawings.

In the drawings:

FIG. 1 is a schematic representation showing the arrangement of the principal means constituting the voicing detection and pitch extraction system.

FIG. 2 is a detailed drawing of the voicing detection and pitch extraction system.

A general understanding of the present invention may now be had from FIG. 1 which shows a schematic arrangement of the principal means constituting the voicing detection and dividual full-wave rectifiers in a rectifier bank 4. The rectified outputs from the rectifier bank 4 are transmitted to a signal processing network 14 by means of lines 4-1a through 4-1511. By means of the signal processing network 14 the speech waveform is reduced to a substantially pure sinusoidal waveform, the frequency of which is proportionally related to the fundamental pitch of the input speech waveform whenever the latter results from voiced speech. The output signal from the processing network 14 is passed on by line 10-1 to a frequency discriminator 11 which translates the instantaneous frequency of the waveform on line 10-11 into a substantially DC signal, the level of which is indicative of the instantaneous frequency of the voice fundamental pitch during intervals of voiced speech. During intervals of no speech or unvoiced speech typical of consonant and fricative sounds, the outputs of the signal processing network 1 1 and the frequency discriminator 11 are essentially random waveforms. The presence and absence of these random waveforms are utilized to discriminate between intervals of voiced and unvoiced speech as follows.

The output of the frequency discriminator is passed by means of line 11-2 to a voice-no voice decision network 13 whose output line 13-1 provides a DC level of a given value when the input line 11-2 issues a pattern of random waveforms which in effect represents the presence of unvoiced sounds in the speech waveform or, a DC level of another value when the input line 11-2 issues said substantially DC signal which in effect represents the presence of voiced sounds in the speech spectrum.

During intervals of voiced speech, the output from the frequency discriminator is substantially a DC level which rises and falls in response to relatively slow variations of the voice pitch. This output is passed on to the input of a low-pass filter 12 by means of line 11-1, the voltage V on line 12-1 representing the output of the filter. The function of this lowpass filter is to remove any small and rapid fluctuations superimposed on the slowly varying level. Thus the voltage V represents the instantaneous fundamental pitch of the voice during intervals of voiced speech.

To appreciate more fully the manner in which the bank of rectifiers and the signal processing network combine to extract signals representative of the voice fundamental pitch, reference is invited to FIG. 2 which shows in more detail the preferred embodiment. In FIG. 2, sound waves entering the system, by way of the microphone 1, are converted into electrical waveform signals by means of the transducing properties of the microphone. These electrical waveform signals enter an amplifier 2 by means of line 1a and are amplified to a suitable level. The amplified signals enter a filter bank 3, by way of line 20, comprised of 15 individual filters, three of which are shown, namely, 3-1, 3-2 and 3-15. The filters employed are of the active network type, each having a bandwidth of approximately 300 Hz., the topmost filter 3-1 having a center frequency of 300 Hz. and the lowermost filter 3-15 a center frequency of 3,000 Hz. The filter bank 31 thus provides a plurality of orthogonal signal channels, controlled by the contiguously tuned filters, each providing, during an interval of voiced speech, a modulated waveform, the envelope of which has a period equal to the period of the fundamental of the voice. Modulation of the envelope of these waveforms results from the linear combination of waveforms constituting the harmonic components of the fundamental voice frequency. The high degree of periodicity of the power spectrum of voiced speech waveforms results from the fact that the predominate mode of excitation of the vocal tract, during intervals of voiced speech, is by means of the glottal vibrator (vocal cords) which is known to possess a substantially sawtooth variation in the opening of the glottis. During these intervals the voiced sound waveforms are predominately rich in harmonics that are integer multiples of the fundamental frequency which, for the male voice extends from about 70 Hz. to 150 Hz. in normal speech, and the meaningful spectrum of which extends from about 300 Hz. to somewhat beyond 3,000 Hz. Thus it is seen, that in this particular embodiment, a minimum of two harmonic components (for the highest fundamental pitch) will be spanned by the passband of each bandpass filter in the filter bank. 1

The modulated waveforms, issuing from the band-pass filters 3-1, 3-2, through 3-15 are passed through full-wave rectifier-s, or detectors, 4-1, 4-2, through 4-15, by way of lines 3-1a, 3-2a, through 3-15a. The function of the rectifiers is to provide a set of signals representative of the time variation of envelopes of the signals issuing from the band-pass filters 3-1, 3-2, through 3-15. The outputs from the rectifiers are transmitted byway of lines 4-1a, 4-2a, through 4-l5a to the fifteen inputs of the signal summing network which includes DC blocking capacitors 5-1a, 5-2a, through 5-15a and resistors S-lb, 5-2b, through 5-15b. The output of the signal summing network is passed through a band-pass filter 6, by means of line 50, having a passband extending from 70 Hz. to 250 Hz. that is more than sufficient to span the frequency range of the male voice fundamental frequency. The output of band-pass filter 6 appearing on line 6-1, during intervals of voiced speech, reflects a fundamental frequency including possible second and third harmonics weaker than the fundamental.

During intervals of unvoiced speech the output of band-pass filter 6 is essentially a band of random noise having significant energy confined to the frequency range 70 Hz. to 250 Hz.

The signal issuing from the band-pass filter 6 is passed on to a balanced modulator 7 by way of line 6-1. The function of thebalanced modulator 7 is to shift the frequency range of the signal issuing from band-pass filter 6 to a considerably higher range of frequencies thereby yielding a frequency-translated signal whose percentage bandwidth is quite narrow with respect to its expected frequency range thus resulting in a reduction in the percentage bandwidth of the signal. The desired action is achieved by driving the balanced modulator 7 with a reference signal provided by local oscillator 6a connected by means of line 6a-1, the local oscillator frequency being typically 15 kHz. The output of the balanced modulator consists of a double sideband suppressed carrier modulated waveform which is passed on to band-pass filter 8 by means of line 7-1. The function of band-pass filter 8 is to select either the upper or lower sideband signal and reject the other sideband signal and any residual carrier signal at the local oscillator frequency of l5 kHz. which may be present in the output from the balanced modulator due to slight imbalance. Typically, the band-pass filter 8 would have a passband extending from l5,070 kHz. to l5,250 kHz., when designed to select the upper sideband signal. The signal output of band-pass filter 8 is substantially a frequency translated version of the signal issuing from band-pass filter 6, but with the distinguishing property of being narrow band. This signal is passed on to a hard limiter 9 by way of line 8-1. The output of the limiter 9 is passed on, by way of line 9-1, to a second band-pass filter 10, having essentially the same passband as the filter 8.

The combined action of limiter 9 and band-pass filter 10 is such that the signal issuing from the band-pass filter 10 will be of substantially constant amplitude with an average frequency linearly related to the voice fundamental during intervals of voiced speech. During intervals of unvoicedspeech the output of band-pass filter 10 is substantially a random noise signal with a significant energy spectrum extending from roughly 15,070 kHz. to 15,250 kHz. This desirable signal processing property just described is the result of signal capture phenomenon exhibited by a hard limiting process followed by band-pass filtering.

The output of filter 10 is passed on to a frequency discriminator 11 by way of line 10-1. One function of discriminator 11 is to detect the quasi instantaneous frequency of the signal issuing from filter 10 during intervals of voiced speech. During intervals of unvoiced speech the output of the frequency discriminator 11 is substantially a random noise signal with significant energy content extending from about 0 Hz. to

' around 180 Hz. The output of the frequency discriminator is passed to a low-pass filter 12 by way of line 11-1. The lowpass filter 12, by virtue of having a cutoff frequency of 15 Hz., serves to remove minor high frequency fluctuations from the output of the frequency discriminator so that the output V, of low-pass filter 12, on line 12-1, provides a voltage level representation of the quasi instantaneous (short term average) voice fundamental frequencyduring intervals of voiced speech.

During the presence of speech sounds, or any other sounds for that matter, that are not harmonic in character, but instead have a structure more akin to random noise, the output signal from the frequency discriminator is of random character. The distinct difference in the character of the signals issuing from the frequency discriminator during intervals of voiced speech and intervals of unvoiced speech, is utilized by the voice-no voice decision network 13 to provide appropriate outputs in the following manner.

The output from the frequency discriminator 11 is passe on to a high-pass filter 17, by way of line 11-2, having a cutoff frequency of about 50 Hz. During voiced speech the output signal from the frequency discriminator has a spectral energy distribution confined to frequencies below 50 Hz. while the output of high pass filter 17 is substantially zero. During intervals of unvoiced speech the signal output from the frequency discriminator is of substantial amplitude and of random character with a spectral energy distribution concentrated in the frequency range above 50 Hz.

The output of high-pass filter 17 is passed on to rectifier 15 by way of line 17-1 and the output from rectifier 15 is then passed on, by way of line 15-1, to low-pass filter 16 having a cutoff frequency of 15 Hz. From the foregoing it is seen that the output from low-pass filter 16 will be substantially a DC level during intervals of unvoiced speech that is different from the DC level output during intervals of voiced speech. These different DC signal levels are utilized in a decision rendering function by passing the output of low-pass filter 16 to a threshold detector circuit 21, by way of line 16-1. The threshold detector circuit, which affects the actual decision, is comprised of a high gain differential DC amplifier 20, input resistor 18, and positive feedback resistor 19. The detection threshold (i.e., the decision threshold) is controlled by the level of reference voltage V, applied to the positive input of the differential amplifier 20. The actual voice or no voice decision is indicated by which of two possible signal levels exists at the output V, of the threshold detector circuit line 13-1.

While the invention has been particularly shown and described with reference to a preferred embodiment thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

What I claim is:

1. A voicing detection apparatus for detecting voiced sounds present in speech waveforms comprising:

a filter bank constituted of a plurality of contiguously tuned band-pass filters responsive to said speech waveforms to provide modulated waveforms, the periodicity of the envelope of each of the latter waveforms corresponding to the periodicity of the voice fundamental;

a plurality of detectors each responsive to a specific modulated waveform to provide appropriate linearly summed time variant signals;

a processing network responsive to said time variant signals to provide a substantially pure sinusoidal waveform whose frequency is proportionally related to the fundamental pitch of voiced sounds in the speech waveforms,

. said processing network comprising a summing network connected to the output of said detectors, a broad bandpass filter connected to the output of said summing network, and a modulator, connected to the output of said broad band-pass filter, to provide a frequency translated signal whose percentage bandwidth is reduced in relation to its expected frequency range;

a frequency discriminator, interconnected to the output of said network, providing a substantially DC signal output, the level of which being a function of the instantaneous voice pitch during voiced speech, and

a decision network interconnection to the frequency discriminator output and providing a DC signal level of one value in response to signals representing unvoiced sounds and a DC signal level of another value in response to signals representing voiced sounds.

I 2. The voicing detection apparatus as in claim 1 wherein said processing network further includes a limiter for enhancing the ratio of the voice fundamental frequency to the harmonic frequencies, and a narrow band-pass filter connected to said limiter for rejecting unwanted higher harmonic frequenones.

3. The voicing detection apparatus as in claim 1 wherein said contiguously tuned band-pass filters are each of the active network type having a passband width of approximately 300 Hz., and said filter bank is adapted to accommodate a spectral bandwidth extending from approximately 300 Hz. to approximately 3,000 Hz. 

1. A voicing detection apparatus for detecting voiced sounds present in speech waveforms comprising: a filter bank constituted of a plurality of contiguously tuned band-pass filters responsive to said speech waveforms to provide modulated waveforms, the periodicity of the envelope of each of the latter waveforms corresponding to the periodicity of the voice fundamental; a plurality of detectors each responsive to a specific modulated waveform to provide appropriate linearly summed time variant signals; a processing network responsive to said time variant signals to provide a substantially pure sinusoidal waveform whose frequency is proportionally related to the fundamental pitch of voiced sounds in the speech waveforms, said processing network comprising a summing network connected to the output of said detectors, a broad band-pass filter connected to the output of said summing network, and a modulator, connected to the output of said broad band-pass filter, to provide a frequency translated signal whose percentage bandwidth is reduced in relation to its expected frequency range; a frequency discriminator, interconnected to the output of said network, providing a substantially DC signal output, the level of which being a function of the instantaneous voice pitch during voiced speech, and a decision network interconnection to the frequency discriminator output and providing a DC signal level of one value in response to signals representing unvoiced sounds and a DC signal level of another value in response to signals representing voiced sounds.
 2. The voicing detection apparatus as in claim 1 wherein said processing network further includes a limiter for enhancing the ratio of the voice fundamental frequency to the harmonic frequencies, and a narrow band-pass filter connected to said limiter for rejecting unwanted higher harmonic frequencies.
 3. The voicing detection apparatus as in claim 1 wherein said contiguously tuned band-pass filters are each of the active network type having a passband width of approximately 300 Hz., and said filter bank is adapted to accommodate a spectral bandwidth extending from approximately 300 Hz. to approximately 3,000 Hz. 